1) Characteristics of overfit models

1-1) Symptoms of overfitting in polynominal regression

In the last module, we talked about the potential for high complexity models to become overfit to the data. And we also discussed this idea of a bias-varience tradeoff. Where high complexity models could have very low bias, but high variance. Whereas low complexity models have high bias, but low variance.

And in this module, what we're gonna do is talk about a way to automatically balance between bias and variance using something called ridge regression.

So let's recall this issue of overfitting in the context of polynomial regression. And remember, this is our polynomial regression model. And if we assume we have some low order of polynomial that we're fitting to our data, we might get a fit that looks like the following. This is just a quadratic fit to the data. But once we get to a much higher order polynomial, we can get these really wild fits to our training observations. Again, this is an instance of a high variance model. But we refer to this model or this fit as being overfit. Because it is very well tuned to our training observations, but it doesn't generalize well to other observations we might see.

Screenshot taken from Coursera 1:00

So, previously we had discussed a very formal notion of what it means for a model to be overfit. In terms of the training error being less than the training error of another model, whose true error is actually smaller than the true error of the model with smaller training error.

Okay, hopefully you remember that from the last module. But a question we have now is, is there some type of quantitative measure that's indicative of when a model is overfit? And to see this, let's look at the following demo, where what we're going to show is that when models become overfit, the estimated coefficients of those models tend to become really large in magnitude.

Screenshot taken from Coursera 2:00

1-2) Overfitting for more general multiple regression models

So, in particular we an also face this issue of overfitting when we get lots and lots of inputs. That represents a very flexible model that can run into the same issues that we saw in our demo for polynomial regression.

Or more generally, we can say just if we have lots of features. So we'll say that capital D is very large. And this could be different functions of our input. But when you include lots and lots of these functions of our inputs, in our regression model then again we're in this place where the model has a lot of flexibility to explain the data and we're subject to becoming overfit.

But this issue of overfitting with respect to increasing model complexity is really relative to how much data that we have. So let's talk about overfitting as a function of the number of observations that we have. As well as a function of the number of inputs, or the complexity of the model.

Screenshot taken from Coursera 1:00

So in particular if we have very few observations and it's small, then our models can rapidly become overfit to the data. Because we have only a few points and as we're increasing in our model complexity like the order of the polynomial, it becomes very easy to hit all of our observations, but in between where we have those observations, things can go very wild.

On the other hand, if we have lots and lots of observations, even with really, really complex models, we're not gonna as quickly become overfit because we have dense observations across our input, so the function is pinned down basically everywhere. In this example asa function of square feet. And it's not able to hit every observation, it's not able to do these really crazy wiggly things.

Screenshot taken from Coursera 2:00

On the other hand when we have just one input like number of square feet of a house in order to avoid overfitting, we need to have observations that are very dense across number of square feet. So we need to have lots of representative examples of square feet and house value pairs. So this is actually pretty hard to do, to have lots of examples of houses of every possible square feet that you might see.

Screenshot taken from Coursera 2:50

So this is already a hard problem, but it becomes even harder when I increase the number of inputs in my model. So, for example, just think of a model where I have square feet and number of bathrooms. And I want to cover all possible combinations of those two inputs in order to provide representative examples and avoid overfitting. Well that's really really hard.

Screenshot taken from Coursera 3:00

2) The ridge objective

2-1) Balancing fit and magnitude of coefficients

So now let's talk about a way to automatically address this issue by modifying the cost term that we're minimizing when we're addressing how good our fit is. So, in particular we're looking at this orange box, this quality metric. And before our quality metric just depended on the difference between our predicted house sales price, and our actual house sales price. In particular we're looking at residual sum of squares for measure of fit. But now we're gonna modify this quality metric to also take into account a measure of the complexity of the model. In particular, in order to buy assess toward simpler models.

Screenshot taken from Coursera 0:40

So when we're thinking about defining this modified cost function, what we're gonna want to do is balance between how well the function fits the data and a measure of how complex, or how potentially overfit, the model is. And what did we see was an indicator of that? The magnitude of our estimated coefficients.

So, what we're going to balance between is the fit of the model to the data and the magnitude of the coefficients of the model. Okay, so we can write down a total cost that has these two terms. Where this is our new measure of the quality of the fit, and when I say measure of fit here, what I mean is that a small number indicates that there's a good fit to the data. And on the other hand, the measure of the magnitude of the coefficients if that number is small that means the size of the coefficients are small and we're unlikely to be in this setting of a very overfit model.

Okay, so clearly we want to balance between these two measures, because if I just optimize the magnitude of the coefficients, I'd set all the coefficients to zero and that would sure not be overfit, but it also would not fit the data well. So that would be a very high bias solution. On the other hand, if I just focused on optimizing the measure of fit, that's what we did before. That's the thing that was subject to becoming overfit in the face of complex models. So somehow we want to trade off between these two terms, and that's what we're going to discuss now.

Screenshot taken from Coursera 2:00

Okay, what's our measure of fit? It's our residual sum of squares, which I've written here and hopefully this formula is quite familiar to you at this point.

But sometimes we also write it as follows where remember $\hat y_i$ is our predicted value using w in our model to make these predictions.

And just remember that a small residual sum of squares is indicative of the model that fit the training data well. So just as we said on the previous slide, when we're thinking about measure of fit, a small number is gonna indicate a good fit.

Screenshot taken from Coursera 3:00

Okay, so now what we need is a measure of the magnitude of the coefficients. So what summary number might be indicative of the size of the regression coefficients?

Well maybe you think about just summing all the coefficients together? Is this gonna be a good measure of the overall magnitude of the coefficients? Probably not in a lot of cases because you might end up with a situation where, let's say,

  • $w_0$ is 1,527,301
  • $w_1$ is -1,605,253

Well if you look at and let's say $w_0$ and $w_1$, the only two coefficients in our model. If I look at $w_0$ + $w_1$, this is gonna be some small number, despite the fact that each of the coefficients themselves were quite large.

Okay, so you might say, I know how to fix this, I'll just look at the absolute value. So, maybe what I'll do, is I'll look at absolute value of $w_0$ + $w_1$ plus all the way up to $w_D$ and this is, I'll just write this compactly, sum from j=0 to capital D, the number of features we have. Absolute value of $w_j$. And this is defined to be equal to what's called the one norm of the vector of coefficient w. So we write it, so this is a vector, I'll try and make this a thick font here, sub 1 and this is called L1 norm. And this is actually a very reasonable choice. And we're gonna discuss this more in the next module.

$$L_1 norm: |w_0| + |w_1| + ... + |w_D| = \sum_{j=0}^D |w_j| \triangleq ||w||_1 $$

But for now the thing that we're gonna consider is to consider the sum of the squares of the coefficients. So w0 squared w1 squared, all the way up to wD squared. So this is the sum j equals zero to capital D, of wj squared. And this is defined to be equal to, we've actually seen this norm many times in this class so far, it's the two norm squared. So this is called our L2 norm, or really the L2 norm squared and this is gonna be the focus of this module.

$$L_2 norm: w_0^2 + w_1^2 + ... + w_D^2 = \sum_{j=0}^D w_j^2 \triangleq ||w||_2^2 $$

Discussion Undertand 2 norms: https://www.coursera.org/learn/ml-regression/module/Uv1l0/discussions/l1Bx052JEeWJ-hJkaqoaKQ

The L2 norm is Euclidean distance. It's just the same as what we normally think of as the distance between two points -- that's the length of the vector from one point to the other, in ordinary, flat, Euclidean space. For instance, the distance of a point (x,y,z) from the origin, in our familiar 3-dimensional physical space, is $\sqrt{x^2 + y^2 + z^2}$. If there are D numerical features in a dataset, those form a D-dimensional vector space, and it can have the same sort of L2 distance defined -- the square root of the sum of squares of the features.

L1 norm uses absolute value in place of squaring to make the components positive so they don't just cancel each other out. It seems like that should be simpler than squaring, but it's not. Absolute value has a discontinuity in its slope at the origin, so it has no derivative there. So if our objective function uses absolute value, we can't take the gradient. If we use the L2 norm, or rather, the square of the L2 norm, then the gradient is fine -- no problems at the origin.

Screenshot taken from Coursera 6:00

So again, what we have, just to summarize, is we have our total cost is a sum of the measure of fit + a measure of the magnitude of coefficients and we said our measure of fit is our residual sum of squares. And our measure of the magnitude of the coefficients for this module is going to be this two norm of the w vector squared.

Screenshot taken from Coursera 6:50

2-2) The resulting ridge objective and its extreme solutions

Okay, so let's consider the resulting objective, where I'm gonna try and search over all possible w vectors. To find the ones that minimize the sum of residual sum of squares plus the square of the two norm of w. So that's gonna be my $\hat w$, my estimated model parameters.

But really what I'd like to do, is I'd like to be able to control how much I'm weighing the complexity of the model as measured by this magnitude of my coefficient, relative to the fit of the model. I'd like to balance between these two terms, and so I'm gonna introduce another parameter. And this is called a tuning parameter. With the model, it's a $\lambda$ (lambda), and this is balancing between this fit and magnitude.

  • If $\lambda = 0$: So let's see what happens if I choose lambda to be 0. Well, if I choose lambda to be 0, this magnitude term that we've introduced completely disappears and my objective reduces down just to minimizing the residual sum of squares. Which was exactly the same as my objective before. So, this reduces to minimizing residual sum of squares of w as before. So this is our old solution, Which leads to some w hat which I'm gonna call w hat superscript LS for least squares. Because what we were doing before is commonly referred to as the least squares solution. So I'm gonna specifically represent the parameters associated with that old procedure we're doing as the least squares parameters.

  • If $\lambda = \infty$: On the other hand, what if I completely crank up that tuning parameter to be infinity? So I have a really, really massively large weight on this magnitude term. Massively large being infinitely large. So as large as you can possibly imagine.

    • So what happens to any solution where w hat is not equal to 0? So, For solutions where w hat does not equal 0. Then the total cost is what? Well I get something that's non-0 times infinity plus something, my residual sum of squares, whatever that happens to be. But the sum of that is infinity. Okay, so my total cost is infinite.
    • On the other hand, what if w hat is exactly equal to 0? Then if w hat equals 0, then total cost is equal to the residual sum of squares of this 0 vector. And that's some number, but it's probably not infinity. Actually it's not infinity, so the minimizing solution here is always gonna be w hat equals 0. Cuz that's the thing that's gonna minimize the total cost over all possible w's.

Okay, so just to recap, we said that if we put that tuning parameter all the way to 0, make it very, very small, all the way to 0. Then we return to our previously square solution and if we crank that parameter all the way up to be infinite. In that limit, we get all of ourcoefficients being exactly 0, okay?

  • If $\lambda$ in between 0: But we're gonna be operating in a regime where lambda is somewhere in between 0 and infinity. And in this case, then we know that the magnitude of our estimated coefficients, they're gonna be less than or equal to the magnitude of our least squares coefficients. In particular, the two norm will be less than. But we also know it's gonna be greater than or equal to 0. So we're gonna be somewhere in between these two regions.
    • And a key question is, what lambda do we actually want? How much do we want to bias away from our least square solution, which was subject to potentially over-fitting, down to this really simple, the most trivial model you can consider which is nothing, no coefficients in the model. What's the model if all the coefficients are 0? Just noise, we just have y equals epsilon, that noise term. Okay, so we're gonna think about somehow trading off between these two extremes.

Screenshot taken from Coursera 5:00

Okay, I wanted to mention that this is referred to as Ridge regression. And that's also known as doing L2 regularization. Because, for reasons that we'll describe a little bit more later in this module, we're regularizing the solution to the old objective that we had, using this L2 norm term.

Screenshot taken from Coursera 5:50

2-3) How ridge regression balances bias and variance

Let's talk about this in the context of the bias variance trade-off.

And what we saw is when we had very large lambda, we had a solution with very high bias, but low variance. And one way to see this is thinking about when we're cranking lambda all the way up to infinity, in that limit, we get coefficients shrunk to be zero, and clearly that's a model with high bias but low variance. It's completely low variance, it doesn't change no matter what data you give me.

On the other hand, when we had very small lambda, we have a model that is low bias, but high variance. And to see this think about setting lambda to zero, in which case, we get out just our old solution, our old lee squares or minimizing residual sum of squares fit. And there we see that for higher complexity models clearly you're gonna have low bias but high variance.

So what we see is this lambda tuning parameter controls our model complexity and controls this bias variance trade-off.

Screenshot taken from Coursera 1:00

Let's return to our polynomial regression demo, but now using ridge regression and see if we can ameliorate the issues of over-fitting as we vary the choice of lambda. And so we're going to explore this ridge regression solution for a couple different choices of this lambda tuning parameter.

Screenshot taken from Coursera 1:20

2-4) The ridge coefficient path

Well, we've motivated analytically how the coefficients that we get when solving this ridge regression problem are gonna change for different settings of lambda. Specifically, we saw that when lambda was 0, we get our least square solution. When lambda goes to infinity, we get very, very small coefficients approaching 0. And in between, we get some other set of coefficients and then we explore this experimentally in this polynomial regression demo.

But one thing that's interesting to draw is what's called the coefficient path for ridge regression. Which shows as you vary lambda, all the way from 0 up towards infinity, how do the coefficients change? So how does my solution change as a function of lambda? And what we're doing in this plot here is we're drawing this for our housing example, where we have eight different features. Number of bedrooms, bathrooms, square feet of the living space, number of square feet of the lot size. Number of floors, the year the house was built, the year the house was renovated, and whether or not the property is waterfront. And for each one of these different inputs to our model are different, and these we're just gonna use as different features, we're drawing what the coefficients, so this would be, Coefficient value for square feet living. For some specific choice of lambda and how that coefficient varies as I increase lambda and I'm showing this for each one of the eight different coefficients.

And I just want to briefly mention that in this figure, we've rescaled the features so that they all have unit norm so each one of these different inputs. That's why all of these coefficients are roughly on the same scale. They're roughly the same order of magnitude. Okay, and so what we see in this plot is, as lambda goes towards 0, or when it's specifically at 0, our solution here. The value of each of these coefficients, so each of these circles touching this line, this is gonna be my w hat least squares solution. And as I increase lambda out towards infinity, I see that my solution, w hat, approaches 0. There's a vector of coefficients is going to 0. And we haven't made lambda large enough in this plot to see them actually really, really, really, really close to 0, but you see the trend happening here.

And then there's some sweet spot in in this plot. Which we're gonna talk about later in this module. So this is gonna represent some $\lambda ^* $ (lambda star). Which will be the value of lambda that we wanna use when we're selecting our specific regularized model to use for forming predictions. And we're gonna discuss how we choose which lambda to use later in the module. But for now, the main point of this plot is to realize that for every value of lambda, every slice of this plot, we get a different solution, a different w hat vector.

Screenshot taken from Coursera 3:00

3) Optimizing the ridge objective

3-1) Computing the gradient of the ridge objective

For now, let's assume that somebody gives you the lambda value that we wanna use in our ridge regression objective and let's discuss algorithmically how we fit this model.

So in particular, in this part, we're describing this gray box our machine learning algorithm.

Screenshot taken from Coursera 0:20

3-1-1) Step 1: Rewrite total cost in matrix notation

And to start with, just like we've done many times before, we're first gonna rewrite our objective in matrix notation, where we assume we have some N observations and we wanna write jointly what the objective is for all N of these observations.

So let's just recall what we did for our residual sum of squares term. Where, for our model, we thought about stacking up all N observations and all the features associated with those N observations in this green matrix. And that matrix got multiplied by a vector of our regression coefficient, and then we had this additive noise per observation. So we wrote this matrix notation as $$y = Hw + \epsilon$$

Screenshot taken from Coursera 1:10

Then, when we went to form our residual sum of squares, we showed that it was equivalent to the following where we have $$ (y - Hw)^T(y - Hw)$$

Okay, so this is our matrix notation for our residual sum of squares term, but now we wanna do a similar term for our model complexity, penalty, that we added to our original objective to get a resulting regression objective.

Screenshot taken from Coursera 1:30

So in particular, we want to write this two norm of w squared in vector notation. So this two norm of our w vector squared, we said was equal to w0 squared + w1 squared + w2 squared + all the way up to our Dth feature squared. And this, is equivalent to taking our w vector, transpose, meaning putting it as a row, and multiplying by the w vector itself.

Because if we think about doing this multiplication, we're gonna get $w_0 * w_0 + w_1 * w_1 + w_2 * w_2$ etc and that's exactly equivalent. And so we can write this as $w^Tw$. Okay, so this is our vector notation forthis model complexity term.

Screenshot taken from Coursera 2:30

So putting it all together, our ridge regression total cost, for some N observations, can be written as follows, where we have $$RSS(w) + \lambda||w||_2^2 = (y - Hw)^T(y - Hw) + \lambda w^Tw$$

Okay, now that we have this, we can start doing what we've done in the past which is take the gradient and we can think about either setting the gradient to zero to get a closed form solution, or doing our gradient descent algorithm. And we're gonna walk through both of these approaches now.

Screenshot taken from Coursera 3:00

3-1-2) Step 2: Compute the gradient

So the first step is computing the gradient of this objective.

So here I'm just writing exactly what we had on the previous slide. But, now, with these gradient signs in front, and as we've seen before, the gradient distributes across a sum.

So we have the gradient of this first term plus the gradient of our model complexity term, or the first term is our measure of fit, that residual sum of squares term. And we know that the gradient, or the residual sum of squares has the following form. $$ - 2H^T(y - Hw)$$

The question is, what's the gradient of this model complexity term? What we see is that the gradient of this is $2w$. Why is it $2w$? Well, instead of deriving it, I'll leave that for a little mini challenge for you guys. It's fairly straightforward to derive, just taking partials with respect to each w component.

Just write $w^Tw$ as $w_0^2 + w_1^2 + ... + w_D^2$. Then take a derivative just with respect to one of the Ws. But for now what I'm gonna do is I'm just gonna draw an analogy to the 1d case where w transpose w is analogous to just w squared. If w weren't a vector and were instead just a scalar. And what's the derivative of w squared? It's 2w. Okay, so proof by analogy here.

Screenshot taken from Coursera 5:00

3-2) Approach 1: closed-form solution

Step 3: Approach 1: Set the gradient = 0

Okay, so our first approach is just gonna be to set our gradient = 0 and solve for W.

But before we get there, let's just do a little linear algebra review of what identity matrices are. So the identity matrics is just the matrics analog of the number 1 and it can be defined in any dimensions so here we show just scalar, here we show a 2 by 2 matrics and what we see the identity matrix is it just places 1's along the diagonal and 0's on the octdiagonal. And that's true in any dimension up to having an N by N matrics. We have N1s on the diagonal, and every other term in the matrics is 0.

So let's discuss a few fun facts about the identity matrics.

  • if you take the identity matrics, and you multiply by a vector v and let's say that this identity matrics is an N by N matrics and it's vectors in N by 1 matrics, you're just gonna get the vector v back. $$Iv = v$$
  • if you multiply this identity matrics by another matrics A, and the A matrix is some N by M matrics, you're just gonna get that A matrics back. $$IA = A$$
  • Then we can talk about a matrics inverse. In this case, we're talking about a square matrics, so $A^{-1}$ and A are both N by N matrices. If we take $A^{-1}A$, then the result is the identity matrics. That's just speak, like when we think about dividing scalars, so this inverse is like the matrics equivalent of division, so if you think of dividing a scalar A by A, you get the number 1. And so this is matrics analog of that. $$A^{-1}A = I$$
  • If you multiply A by A inverse you also get the identity. $$AA^{-1} = I$$

There are just some fun facts about the identity matrics as well as inverses that are gonna be useful in this module and probably in other modules we have later on, as well.

And what we're gonna do now, now that we understand this identity matrics is simply rewrite the gradient of the total cost with this identity matrics. So this exactly the same. All we've done is we've replaced w. This w vector, by the identity times w. So these are equivalent. But this is gonna be helpful in our next derivation.

Screenshot taken from Coursera 3:00

Okay, so now we can take this equivalent form of the gradient of our total cost, and set it equal to zero. $$\bigtriangledown cost(w) = -2H^T(y-Hw) + 2\lambda Iw = 0$$

  • First thing we can do is just divide both sides by two to get rid of those twos. And when we're setting this equal to zero I'm gonna put the hat on the w, because that's what we're solving for $$ -H^Ty + H^TH\hat w + \lambda I\hat w = 0$$
  • Then I can bring this to the other side $$ H^TH\hat w + \lambda I\hat w = H^Ty$$
  • And then what I see is I have w hat appearing in both of these terms. So I can factor it out. $$ (H^TH + \lambda I)\hat w = H^Ty$$
  • So this is the step where having that identity matrics was useful. So I hope it was worth everything on the last slide to get that one little punch line. If we use our little inverse from the previous slide, if I pre-multiply both sides by $(H^TH + \lambda I)^{-1}$, then I get: $$ \hat w = (H^TH + \lambda I)^{-1}H^Ty$$

And in particular, I'm gonna call this w hat ridge to indicate that it's the ridge regression solution for a specific value of lambda.

Screenshot taken from Coursera 5:00

3-3) Discussing the closed-form solution

Now let's think a little bit about the solution. $$\hat w^{ridge}= (H^TH + \lambda I)^{-1}H^Ty$$

  • If $\lambda = 0$: $\hat w^{ridge}= (H^TH)^{-1}H^Ty = \hat w^{LS}$. And that might look very familiar to you. That was exactly equal to our w hat least squares, our old solution, before we introduced this notion of ridge regression.
  • If $\lambda = \infty$: $\hat w^{ridge}= 0$. Because it's like dividing by infinity. When we have this infinity appearing in this inverse. Remember the inverse was like our matrix analog of division, so that's intuition for why w hat ridge is exactly equal to zero.

So this is a little sanity check. That when lambda is equal to zero, this closed form solution we have is exactly equal to our least square solution. That's what we had discussed at the very beginning of this module. And likewise, when we crank lambda all the way up to infinity, our solution is equal to zero. But now what we have is we have a closed form for what the solution is for some lambda in between zero and infinity.

Screenshot taken from Coursera 1:00

Let's also recall the discussion we had about our previous solution, w hat least squares, where we said this $H^TH$ looks exactly like the big matrix in the picture.

The number of rows is equivalent to the number of observations, that's N and the number of columns is equal to the number of features, which we denote as D.

And what we said in the module before is $H^TH$, the multiplication of these two matrices is invertible. In general, if the number of observations is greater than the number of features, but really it's the number of linearly independent observations being greater than the number of features, and we said the complexity of this inverse is cubic in the number of features D.

Screenshot taken from Coursera 3:00

Well now let's think about similar properties, but of our ridge regression solution.

Where this $H^TH$ is exactly like we had it before. But now before we take our inverse, we're adding lambda times the identity matrix. And when you take a scaler and multiply by the identity matrix, you just get that value along the diagonal, so we get a whole bunch of lambdas along the diagonal and zero everywhere else in this matrix.

So what ends up happening now is the result, $H^TH + \lambda I$ is invertible always when lambda is greater than zero. Even if the number of observations or number of linearly independent observations is less than the number of features. So it is invertible if : always if $\lambda > 0$, even if N < D. So this is really important, when you have lots of features. So for large D, which has lots of features, and remember, that's how we motivated using ridge regression. We're in these really complicated models where you have lots and lots of features, a lot of flexibility and the potential to over fit. Now we see something very explicit about how it helps us.

And just to return to the discussion on the naming of ridge regression being called a regularization technique. If you remember, I said that we're regularizing our standardly square solution. Well we can see that here, because $\lambda I$ is making $H^TH + \lambda I$ more regular. That's what's allowing us to do this inverse even in this other situation, this harder situation, and because this result is more regular, we call it regularized.

But, the complexity of the inverse is still cubic in the number of features we have and often when we're thinking about ridge regression, like I said we're thinking about cases where you have lots and lots of features, so doing this close form solution that we've shown here can be computationally prohibitive.

Screenshot taken from Coursera 5:00

3-4) Approach 2: gradient descent

Step 3: Approach 2: gradient descent

So instead of this closed form solution that can be computationally prohibitive and think about using our group gradient descent algorithm to solve our ridge regression.

And let's just talk about the element wise update that we have. So it's the update to the jth feature weight.

And remember here I'm showing the gradient of our total cost, just as a reminder. $$\bigtriangledown cost(w) = -2H^T(y-Hw) + 2\lambda w $$

And what we do is we take our previous weight for the jth feature and we subtract eta our step size, times. Whatever this gradient update dictates to form our new weight. $$w_j^{(t+1)} \leftarrow w_j^{(t)} - \eta * [-2 \sum_{i=1}^N h_j(x_i)(y_i - \hat y_i(w^{(t)}) + 2 \lambda w_j^{(t)}]$$

And what does this gradient tell us? Well, this first term here on this line is exactly what we had before from our residual sum of squares term. So this is same as, before this comes from our residual sum of squares term.

But now in addition to this we have another term appearing. We have this term $2 \lambda w_j^{(t)}$. So, this is our new term that comes from, The jth component of $$2 \lambda w$.

Okay, so one thing we see is we have a $w_j^{(t)}$ appearing in two places. So we can combine these and simplify the form of this expression to get a little bit more interpretability.

Screenshot taken from Coursera 1:00

So what we see is that, our update now becomes: $$w_j^{(t+1)} \leftarrow (1 - 2 \eta \lambda)w_j^{(t)} + 2 \sum_{i=1}^N h_j(x_i)(y_i - \hat y_i(w^{(t)})$$

  • $\eta$: our step size, which is greater than 0
  • $\lambda$: our tuning parameter, which is greater than 0
  • So the result of $2 \eta \lambda$ is a positive number, and we have (1 - positive number). We know that this $(1 - 2 \eta \lambda)$ is going to be less than or equal to 1. But the other thing we want is we want the result of this to be positive. So to enforce that we're going to say that $2 \eta \lambda$ has to be less than 1. Okay, so it's a positive quantity that is less than 1.

Okay, so what we're doing is we're taking our previous coefficient $w_j^{(t)}$ and we're multiplying by some number that's less than 1, a positive number less than 1, so what that's doing is it's just shrinking the value by some amount, at some fraction.

And then we have this update here $2 \sum_{i=1}^N h_j(x_i)(y_i - \hat y_i(w^{(t)})$ which is the same update that we had just from our residual summer squares as before so I'll just call this our update term from residual sum of squares.

Discussion:

And so let's draw a picture of what's happening here. And what I'm gonna do is I'm gonna draw a picture that's a function of iterations going this way. And magnitude of the coefficients along the y axis.

So, we're gonna say that at some iteration T, we have a coefficient $w_j^{(t)}$. And what we're doing in ridge regression is we're introducing this intermediate step where here we first shrink the value of the coefficient, according to $(1 - 2 \eta \lambda)w_j^{(t)}$. And so, just to be very clear this is an intermediate step introduced in ridge regression.

This is some in between iteration T and when we get to iteration T + 1. What we do is we take whatever this update term is. It could be positive. It could be negative, and we modify this shrunken coefficient. So let's just assume for this illustration that it's positive, some positive value. So, I'm going to take the last value of my shrunken coefficient.

And then I'm gonna add some amount that makes it, let's say a little bit less than what my coefficient was before. So I'm gonna add this is my little update term from RSS. And this will give me wj at iteration T+1. And that's contrast with what our old standard least squares algorithm did.

So previously, when we only has RSS. Which was our least squares solution. We had $w_j^{(t)}$, so we've started at the same point. But instead of shrinking the coefficient with this intermediate step, we just took this little update term from residual sum of squares and modified this coefficient. So I'm gonna try and map over to time to iteration T + 1 and then add the same amount, the same update term so again this is the same update term that I'm adding to $w_j^{(t)}$ but this is gonna give me my $w_j^{(t + 1)}$ according to the least squares algorithm.

Okay, so just to be clear, what's going on here in case this picture was not helpful to you, is in the ridge regression case we're taking our previous value of our coefficient, our jth regression coefficient. And the first thing we're doing, before we decide whether what our fit term is saying we should do to it, is we first shrink it. At every iteration we first take the last value and we make it smaller. And that's where the model complexity term is coming in. Where we're saying we want smaller coefficients to avoid over fitting, but then after that we're going to say well okay let's listen to how well my model's fitting the data. That's where this update term is coming in and we're going to modify our our shrunken coefficient.

And in contrast, our old solution never did that shrinking first, it just said, okay, let's just listen to the fit, and listen to the fit, and listen to the fit at every iteration and that's where we could get into issues of overfitting.

Screenshot taken from Coursera 7:00

Okay, so here's a summary of our standard lee squares gradient descent algorithm. This is what we had again two modules ago where we just initialize our weights vector, and while not converge for every feature dimension. We're going to go through, compute this partial derivative of the residual sum of squares. That's that update term I was talking about on the last slide. And then I'm going to use that to modify my previous coefficient according to the step size eta. And I'm gonna keep iterating until convergence. Okay, so this is what you code up before.

Screenshot taken from Coursera 8:00

Now let's look at what we do to have to modify this code to get a rich regression variant. And notice that the only change is in this line for updating $w_j$ where, instead of taking $w_j^{(t)}$ and doing this update with what I'm calling partial here, we're taking $(1- 2\eta \lambda)$w_j^{(t)}$ and doing the update. So it's a trivial, trivial modification to your code. And that's really cool. It's very simple to implement ridge regression given a specific Lambda value.

Screenshot taken from Coursera 9:00

4) Tying up the loose ends

4-1) Selecting tuning parameters via cross validation

Now let's turn to this important question of how to choose the lambda tuning parameter.

And what we mentioned last module was that if we have some tuning parameter that controls model complexity, then we can think for every value of that tuning parameter, we can fit our model on our training data. Then we can assess the performance of that fitted model on a validation set, and we can tabulate this for all values of lambda that we might consider. And choose the specific model complexity according to the error on this validation set, and then assess the performance of the selected model on our test set.

Well, now what we've seen is ridge regression is a special case of an objective where there's a tuning parameter, lambda, that's controlling model complexity. And we'd like to see how we can select this tuning parameter. So, of course we can use exactly this procedure that we described last module. And that's assuming that we have sufficient data to do this training validation test split. But now let's ask the question of, what if we don't have enough data to reasonably do a divide into these three different sets? What can you do? And we're gonna talk about this in the context of rich regression, but again, this holds for any tuning parameter lambda that you might have in selecting between different model complexities. Or any other tuning parameter controlling your model.

Screenshot taken from Coursera 1:00

We're assuming that we're starting with a smallish dataset.

Screenshot taken from Coursera 1:00

And as always, we need to break off some test dataset that we're gonna hide.

Screenshot taken from Coursera 1:00

We always need to have some test set that's never touched during training or validation of our model. So now we took this smallish dataset and we have even a smaller dataset to do our training and model validation. So how are we gonna do this? Well, we wanna do this in some way that's a bit smarter than the naive approach of just forming our validation set.

Screenshot taken from Coursera 2:00

So let's just remember this naive approach, where we took whatever data was remaining after we split off our test set and we defined some validation set.

And a question is, in this case, when we have just a small amount of data, so necessarily this validation set will just be a small number of observations, is this sufficient for comparing between different model complexities and accessing which one is best? Well, no, clearly the answer is no. We're saying that it's just a small set. It's not gonna be representative of the space of things that we might see out there.

Screenshot taken from Coursera 2:50

Well, did we have to use the last set of tabulated observations as the observations to define this validation set? No, I could of used the first few observations, or next set of observations, or any random subset of observations in this dataset.

Screenshot taken from Coursera 3:00

And a question is, which subset of observations should I use as my validation set? And the answer, and this is the key insight, is use all of the data subsets. Because if you're doing that, then you can think about averaging your performance across these validation sets. And avoid any sensitivity you might have to one specific choice of validation set that might give some strange numbers because it just has a few observations. It doesn't give a good assessment of comparison between different model complexities.

Screenshot taken from Coursera 3:40

4-2) K-fold cross validation

How are we gonna use all of our data as our validation set? We're gonna use something called K-fold cross validation. Where the first step, it's just a preprocessing step. Where we're gonna take our data, and divide it into K different blocks. And, we have N total observations, so, every block of data is gonna have N over K observations, and these observations are randomly assigned to each block. Okay, so, this is really key that we're taking our tabulated data. And in this image, even though it looks like it just might be parceling out a table of data, the data in each one of these blocks is randomly assigned. And for all steps of the algorithm that I'm gonna describe now, we're gonna use exactly the same data split. So, exactly the same assignments of observations to each one of these different blocks.

Screenshot taken from Coursera 0:30

Okay, then for each one of these K different blocks, we're gonna cycle through treating each block as the validation set. And using all the remaining observations to fit the model for every value of lambda.

So in particular, we're gonna start by saying for a specific value of lambda and we're gonna do a procedure for each of the K blocks and at the end we're gonna cycle through all values of lambda. So for right now, assume that we're looking at a specific lambda value out of a set of possible values we might look at. And now we're gonna cycle through each one of our blocks where at the first iteration we're gonna fit our model using all the remaining data. That's gonna produce something that I'm calling $\hat w_{\lambda} $, so indexed by this lambda that we're looking at. So we're considering the first block as our validation set.

Then we're gonna take that fitted model and we're gonna assess it's performance on this validation site. That's gonna result in some error which I'm calling error sub one. Meaning the error on the first block of data for this value of lambda.

Okay, so I'm gonna keep track, of the error for the value of lambda for each block, and then I'm gonna do this for every value of lambda.

Screenshot taken from Coursera 2:00

I'm gonna move on to the next block, treat that as my validation set, fit the model on all the remaining data, compute the error of that fitted model on the next block of data.

Screenshot taken from Coursera 2:30

And at the end, I've tabulated my error across each of these K different blocks for this value of lambda. And what I'm gonna do is I'm gonna compute what's called the cross validation error of lambda, which is simply an average of the error that I had on each of the K different blocks. So now I explicitly see how my measure of error, my summary of error for the specific value lambda uses all of the data. it's an average across the validation sets in each of the different blocks.

Screenshot taken from Coursera 3:00

Then, I'm gonna repeat this procedure for every value that I'm considering of lambda and I'm gonna choose the lambda that minimizes this cross validation error.

Screenshot taken from Coursera 3:20

So I had to divide my data into K different blocks in order to run this K full cross validation algorithm. So a natural question is what value of K should I use?

Well you can show that the best approximation to the generalization error of the model is given when you take K to be equal to N. And what that means is that every block has just one observation. So this is called leave-one-out cross validation. So although it has the best approximation of what you're trying to estimate, it tends to be very computationally intensive, because what do we have to do for every value of lambda? We have to do N fits of our model. And if N is even reasonably large, and if it's complicated to fit our model each time, that can be quite intensive.

So, instead what people tend to do is use K = 5 or 10, this is called 5-fold or 10-fold cross validation.

This summarizes our cross validation algorithm, which is a really, really important algorithm for choosing tuning parameters. And even though we discussed this option of forming a training validation and test set, typically you're in a situation where you don't have enough data to form each one of those. Or at least you don't know if you have enough data to have an accurate approximation of generalization error as well as assessing the difference between different models, so typically what people do is cross validation. They hold out some test set and then they do either leave one out, 5-fold, 10-fold cross validation to choose their tuning parameter lambda. And this is a really critical step in the machine learning workflow is choosing these tuning parameters in order to select a model and use that for the predictions or various tasks that you're interested in.

Screenshot taken from Coursera 4:00

4-3) How to handle the intercept

Well we discussed ridge regression and cross-validation. But we kinda brushed under the rug what can be a fairly important issue when we discussed our ridge regression objective, which is how to deal with the intercept term that's commonly included in most models.

So in particular let's recall our multiple regression model, which is shown here. And so far we've just treated generically that there's some $h_0(x)$, that's our first feature with coefficient $w_0$. But as we mentioned two modules ago, typically, that first feature is treated to be what's called the constant feature, so that $w_0$ just represents the intercept of the model. So if you're thinking of some hyper-point issues, where is it sitting along that y-axis? And then all the other features are some arbitrary set of other terms that you might be interested in.

Screenshot taken from Coursera 1:00

Okay. Well if we have this constant feature in our model, then the model that I wrote on the previous slide simplifies to the following.

Where in this case when we think of our matrix notation for having And different observations. When we're forming our H matrix, the first column of that matrix, that's the coefficient for the $w_0$ term, the $w_0$ coefficient. So in this special case, that entire first column is filled entirely with ones. So that we get $w_0$ all along as the first feature for every observation. Okay so this is the specific form that our H matrix is gonna take in this case where we have an intercept term in the model.

Screenshot taken from Coursera 1:30

Now let's return to our standard ridge regression objective that we had where we said we have the $$RSS(w) + \lambda||w||_2^2$$ where that $||w||_2^2$ vector included $w_0$ for the intercept term in the model swhere that's what it represents.

So a question is does this really make sense to do? Because what this is doing is it's encouraging that intercept term to be small. That's what this ridge regression penalty is doing. And do we want a small intercept? So it's useful to think about doing ridge regression when you're adding lots and lots of features but regardless of how many features you add to your model, does that really matter in how we're thinking about the magnitude of the intercept? Not really. So it probably doesn't make a lot of sense intuitively to think about shrinking the intercept just because we have this very flexible model with lots of other features. So let's think about how to address this.

Screenshot taken from Coursera 2:30

The first option we have is not to penalize the intercept term.

And the way we can do that is to separate out that $w_0$ coefficient from all the other w's. $w_1$, $w_2$ all the way up to $w_D$, when we're thinking about that penalty term. So we have residual sum of squares of $w_0$, and what I'll call $w_{rest}$, all those other w's. And when we add our ridge regression penalty, the 2 norm is only taken of that w rest factor. All those w's not including our intercept.

So a question is, how do we implement this in practice? How is this gonna modify the closed form solution or the gradient descent algorithm that we showed previously when we weren't handling this specific case.

Screenshot taken from Coursera 3:00

So the very simple modification we can make is simply defining something that I'm calling $I^{mod}$. It's a modified identity matrix. That has a 0 in the first entry, and so in the one-one entry, and all the other elements are exactly the same as an identity matrix before.

So to be explicit our $H^TH$ terms is gonna look just as it did before but now this $\lambda I^{mod}$ has a 0 in the entry Corresponding to the $w_0$ index.

Screenshot taken from Coursera 4:00

Now let's look at our gradient descent algorithm.

And here it's gonna be very simple, we just add in a special case that if we're updating our intercept term, so if we're looking at that zero feature, we're just gonna use our old least square update, no shrinkage to $w_0$. But otherwise, for all other features we're gonna do the ridge update.

Okay so we see algorithmically its very straightforward to make this modification where we don't want to penalize that intercept term.

Screenshot taken from Coursera 5:00

But there's another option we have which is to transform the data.

So in particular if we center the data about 0 as a pre-processing step then it doesn't matter so much we're shrinking the intercept towards 0 and not correcting for that, because when we have data centered about 0 in general we tend to believe that the intercept will be pretty small. So here what I'm saying is step one, first we transform all our y observations to have mean 0. And then as a second step we just run exactly the ridge regression we described at the beginning of this module. Where we don't account for the fact that there's this intercept term at all. So, that's another perfectly reasonable solution to this problem.

Discussion

Understand center data first approach: https://www.coursera.org/learn/ml-regression/lecture/3KZiN/how-to-handle-the-intercept/discussions/QQ5Pep8TEeWILRIOm1V0SQ

  • Suppose you have y vector with values [1, 2, 3, 4, 5]. It's mean value is 3. If you transform the y with y_new = y - 3, you get y_new = [-2, -1, 0, 1, 2]. Now y_new has a mean value of 0.

Screenshot taken from Coursera 5:30

4-4) A brief recap

In summary, we've presented this concept of ridge regression, which is a regularized form of standard linear regression. It allows us to account for having lots and lots of features in a very straightforward way, both intuitively and algorithmically, as we've explored in this module.

And what ridge regression is allowing us to do is automatically perform this bias variance tradeoff. So we thought about how to perform ridge regression for a specific value of lambda, and then we talked about this method of cross validation in order to select the actual lambda we're gonna use for our models that we would use to make predictions.

In summary,

  • we've described why ridge regression might be a reasonable thing to do.
  • Motivating that the magnitude term that ridge regression introduces, the magnitude of the coefficients. Penalizing that makes sense from the standpoint of over-fitted models tend to have very large magnitude coefficients.
  • Then we talked about the actual idge regression objective and thinking about how it's balancing fit with the magnitude of these coefficients.
  • And we talked about how to fit the model both as a closed form solution as well as gradient descent.
  • And then how to choose our value of lambda using cross validation, and that method generalizes well beyond regression, let alone just ridge regression.
  • And then finally, we talked about how to deal with the intercept term, if you wanna handle that specifically.

Screenshot taken from Coursera 1:00